Generated: /root/verbalize-sampling
Experiments: 4
Metrics: diversity, ngram, joke_quality
| Name | Task | Method | Model | Responses | Temperature |
|---|---|---|---|---|---|
| direct (samples=1) | joke | direct | anthropic/claude-4-sonnet | 30 | 0.7 |
| sequence [strict] (samples=5) | joke | sequence | anthropic/claude-4-sonnet | 30 | 0.7 |
| multi_turn [strict] (samples=5) | joke | multi_turn | anthropic/claude-4-sonnet | 30 | 0.7 |
| structure_with_prob [strict] (samples=5) | joke | structure_with_prob | anthropic/claude-4-sonnet | 30 | 0.7 |
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why did the scarecrow win an award? Because he was outstanding in his field!
Tell me a joke.
Why don't skeletons fight each other? They don't have the guts!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
Tell me a joke.
Why don't scientists trust atoms? Because they make up everything!
| Experiment | Average Diversity | Min Diversity | Max Diversity | Std Diversity | Average Response Length | Average Unique Words | Average Vocabulary Richness | Total Cost |
|---|---|---|---|---|---|---|---|---|
| direct (samples=1) | 0.0327 | 0.0000 | 0.2404 | 0.0724 | 13.2267 | 13.1667 | 0.9965 | 0.0000 |
| sequence [strict] (samples=5) | 0.2992 | 0.0000 | 0.4896 | 0.1345 | 14.5467 | 14.0467 | 0.9754 | 0.0001 |
| multi_turn [strict] (samples=5) | 0.2472 | 0.0000 | 0.4759 | 0.1306 | 14.9600 | 14.5733 | 0.9835 | 0.0001 |
| structure_with_prob [strict] (samples=5) | 0.2973 | 0.0000 | 0.4958 | 0.1415 | 13.3400 | 12.8800 | 0.9747 | 0.0001 |
| Experiment | Average Rouge L | Min Rouge L | Max Rouge L | Std Rouge L | Average Response Length |
|---|---|---|---|---|---|
| direct (samples=1) | 0.8692 | 0.1333 | 1.0000 | 0.2677 | 13.2267 |
| sequence [strict] (samples=5) | 0.1817 | 0.0000 | 1.0000 | 0.3051 | 14.5467 |
| multi_turn [strict] (samples=5) | 0.2943 | 0.0000 | 1.0000 | 0.3139 | 14.9600 |
| structure_with_prob [strict] (samples=5) | 0.1902 | 0.0000 | 1.0000 | 0.3362 | 13.3400 |
| Experiment | Funniness | Cleverness | Originality | Structure | Overall | Normalized Overall |
|---|---|---|---|---|---|---|
| direct (samples=1) | 3.7933 | 3.8800 | 2.5800 | 4.2667 | 3.6433 | 0.7287 |
| sequence [strict] (samples=5) | 3.5733 | 3.8867 | 2.5800 | 4.1800 | 3.5440 | 0.7088 |
| multi_turn [strict] (samples=5) | 3.3933 | 3.7800 | 2.5733 | 4.1333 | 3.4370 | 0.6874 |
| structure_with_prob [strict] (samples=5) | 3.4733 | 3.8000 | 2.4067 | 4.2133 | 3.4527 | 0.6905 |